Cellen 2

Data wrangling III — pivoting

Gavin Simpson

Aarhus University

2025-02-17

Learning objectives

In this video of the data wrangling topic you will

  • Learn how to convert data between wide and long formats using the tidyr 📦

Pivoting

Pivoting

Converting between a wide and long representations is called pivoting

gather() & spread()

An earlier implementation in tidyr 📦 used

  • gather()
  • spread()

roughly corresponding to

  • pivot_longer()
  • pivot_wider()

Many examples online and in 1st edition of r4ds used gather() and spread() but the pivot_*() functions are to be preferred now

pivot_longer()

How does pivoting data to a longer format work?

Say we have a data set with three dogs (A, B, and C) and we have taken their blood pressure twice each

dogs <- tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)

If we want to plot or work with the BP measurements, we need these in a single column (variable)

pivot_longer()

dogs
## # A tibble: 3 × 3
##   id      bp1   bp2
##   <chr> <dbl> <dbl>
## 1 A       100   120
## 2 B       140   115
## 3 C       120   125

Our aim is to have a data frame with three variables

  1. dog
  2. measurement
  3. value

We achieve this with pivot_longer()

pivot_longer()

dogs |> 
  pivot_longer(
    cols = bp1:bp2,
    names_to = "measurement",
    values_to = "value"
  )
## # A tibble: 6 × 3
##   id    measurement value
##   <chr> <chr>       <dbl>
## 1 A     bp1           100
## 2 A     bp2           120
## 3 B     bp1           140
## 4 B     bp2           115
## 5 C     bp1           120
## 6 C     bp2           125

pivot_longer()

The id column is already a variable

It needs to be repeated once per column that we are pivoting

Here we are pivoting two columns, so we repeat each id twice

pivot_longer()

The column names from dogs:

  • bp1
  • bp2

become the values in a new column, whose name was give by the names_to argument: "measurement"

The original column names need to be repeated once per row of the original data

pivot_longer()

The cell values in the original data also become a new variable

The name of this new variable is given by the values_to argument: "value"

They are unwound, row by row

pivot_longer()

dogs |> 
  pivot_longer(
    cols = bp1:bp2,           # which columns are we pivoting?
    names_to = "measurement", # variable to create in long form
    values_to = "value"       # variable to hold the cell values
  )
## # A tibble: 6 × 3
##   id    measurement value
##   <chr> <chr>       <dbl>
## 1 A     bp1           100
## 2 A     bp2           120
## 3 B     bp1           140
## 4 B     bp2           115
## 5 C     bp1           120
## 6 C     bp2           125

More complicated settings

Sometimes you will have data like this, with multiple variables per column

## # A tibble: 3 × 679
##   country  h_1996_Jan c_1996_Jan h_1996_Feb c_1996_Feb h_1996_Mar c_1996_Mar
##   <chr>         <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
## 1 England       91024        355      90735        392      90512        440
## 2 Wales         21296        152      21256        164      21195        182
## 3 Scotland      20080         32      20034         31      19989         31
## # ℹ 672 more variables: h_1996_Apr <dbl>, c_1996_Apr <dbl>, h_1996_May <dbl>,
## #   c_1996_May <dbl>, h_1996_Jun <dbl>, c_1996_Jun <dbl>, h_1996_Jul <dbl>,
## #   c_1996_Jul <dbl>, h_1996_Aug <dbl>, c_1996_Aug <dbl>, h_1996_Sep <dbl>,
## #   c_1996_Sep <dbl>, h_1996_Oct <dbl>, c_1996_Oct <dbl>, h_1996_Nov <dbl>,
## #   c_1996_Nov <dbl>, h_1996_Dec <dbl>, c_1996_Dec <dbl>, h_1997_Jan <dbl>,
## #   c_1997_Jan <dbl>, h_1997_Feb <dbl>, c_1997_Feb <dbl>, h_1997_Mar <dbl>,
## #   c_1997_Mar <dbl>, h_1997_Apr <dbl>, c_1997_Apr <dbl>, h_1997_May <dbl>, …

We have variables named h_1996_Jan and c_1996_Feb, where

  • h indicates if the data are number of herds
  • c indicates if the data are number of cases

pivot_longer()

bovine_long <- bovine_wide |> 
  pivot_longer(
    cols = !country,                       # which columns are we pivoting?
    names_to = c("type", "year", "month"), # variables to create in long form
    names_sep = "_",                       # split names at _
    values_to = "n"                        # variable to hold the cell values
  )
bovine_long
## # A tibble: 2,034 × 5
##    country type  year  month     n
##    <chr>   <chr> <chr> <chr> <dbl>
##  1 England h     1996  Jan   91024
##  2 England c     1996  Jan     355
##  3 England h     1996  Feb   90735
##  4 England c     1996  Feb     392
##  5 England h     1996  Mar   90512
##  6 England c     1996  Mar     440
##  7 England h     1996  Apr   90210
##  8 England c     1996  Apr     454
##  9 England h     1996  May   89357
## 10 England c     1996  May     434
## # ℹ 2,024 more rows

pivot_longer()

Conceptually this is not too different from what we already saw

Instead of the columns names pivoting into a single column, they pivot to multiple columns

pivot_wider()

Having now separated the data into a long format we should see that there are multiple variables in the type column

bovine_long |>
  distinct(type)
## # A tibble: 2 × 1
##   type 
##   <chr>
## 1 h    
## 2 c

The observations are at the level of country, by year and month

But the data for each observation are spread over two rows

pivot_wider()

We can use pivot_wider() to pivot the data to a wider format to create n_herds and n_cases columns

bovine_long |>
  pivot_wider(
    id_cols = c(country, year, month), # which columns are we *not* pivoting
    names_from = type,                 # column with names of new variables
    values_from = n                    # column with values for new variables
  ) |>
  rename(n_herds = h, n_cases = c)     # rename for nicer names
## # A tibble: 1,017 × 5
##    country year  month n_herds n_cases
##    <chr>   <chr> <chr>   <dbl>   <dbl>
##  1 England 1996  Jan     91024     355
##  2 England 1996  Feb     90735     392
##  3 England 1996  Mar     90512     440
##  4 England 1996  Apr     90210     454
##  5 England 1996  May     89357     434
##  6 England 1996  Jun     88554     403
##  7 England 1996  Jul     87802     387
##  8 England 1996  Aug     87356     379
##  9 England 1996  Sep     86879     313
## 10 England 1996  Oct     86065     274
## # ℹ 1,007 more rows

pivot_wider()

To see how pivot_wider() works, we’ll revisit the small dog blood pressure data set, but this time we have only two dogs

dogs2 <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115, 
  "A",        "bp2",    120,
  "A",        "bp3",    105
)

We want to create a data frame with column names taken from "measurement" and fill the cells with the values from "value"

pivot_wider()

We want to create a data frame with columns names taken from "measurement" and fill the cells with the values from "value"

dogs2 |> 
  pivot_wider(
    names_from = measurement, # which column(s) to take the names of new columns from
    values_from = value       # which column to take the values for the cells from
  )
## # A tibble: 2 × 4
##   id      bp1   bp2   bp3
##   <chr> <dbl> <dbl> <dbl>
## 1 A       100   120   105
## 2 B       140   115    NA

How pivot_wider() works — I

First, pivot_wider() has to identify what the new columns will be

dogs2 |> 
  distinct(measurement) |> 
  pull()
## [1] "bp1" "bp2" "bp3"

How pivot_wider() works — II

By default, the rows in the output are determined from those variables not going to new variables (names) or values

These are the id_cols and should uniquely identify a row in the new (wider) data frame

Here we only have 1 ID column

dogs2 |> 
  select(-measurement, -value) |> 
  distinct()
## # A tibble: 2 × 1
##   id   
##   <chr>
## 1 A    
## 2 B

How pivot_wider() works — III

pivot_wider() then uses these combinations to create an empty data frame

dogs2 |> 
  select(-measurement, -value) |> 
  distinct() |> 
  mutate(bp1 = NA, bp2 = NA, bp3 = NA)
## # A tibble: 2 × 4
##   id    bp1   bp2   bp3  
##   <chr> <lgl> <lgl> <lgl>
## 1 A     NA    NA    NA   
## 2 B     NA    NA    NA

Where bp1 etc were the distinct values of measurement

How pivot_wider() works — IV

pivot_wider() then fills in the new data frame with the relevant values

Not every dog had three BP measurements, so there is no value to fill in the data frame with

In that case the value is NA (missing)

dogs2 |> 
  pivot_wider(
    names_from = measurement, # which column(s) to take the names of new columns from
    values_from = value       # which column to take the values for the cells from
  )
## # A tibble: 2 × 4
##   id      bp1   bp2   bp3
##   <chr> <dbl> <dbl> <dbl>
## 1 A       100   120   105
## 2 B       140   115    NA